Fix groupby any/all on null-containing string columns#22926
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughSummary by CodeRabbit
Walkthrough
ChangesGroupBy boolean reduction null mask and dtype fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/merge |
…2926) `GroupBy.all` / `GroupBy.any` returned wrong results for string columns containing nulls. Two issues in `GroupBy._bool_reduce`'s string→bool coercion: 1. **Lost missingness for `na_value=np.nan` string dtypes.** `count_characters() > 0` does not propagate the source null (a NaN-flavored missing compares as `False`), so a missing value was treated as a real empty/falsy string. The source column's null positions are now restored on the bool column. 2. **`astype(np.bool_)` on a null-containing extension bool.** Casting a null-containing pandas-extension bool to numpy bool is (intentionally) rejected in pandas-compatible mode. The bool column is now kept nullable through the `min`/`max` aggregation and normalized to `np.bool_` only after empty/`skipna` groups are filled. As a result, `all` over an all-null group is now vacuously `True` and `any` is `False` (with nulls dropped under `skipna=True`), matching pandas across all four `StringDtype` variants (`python`/`pyarrow` × `na_value=pd.NA`/`np.nan`). This fixes `tests/groupby/test_reductions.py::test_string_dtype_all_na` (40 parametrizations) in the `cudf.pandas` pandas test suite, and incidentally `test_any` and `test_groupby_bool_aggs[True-any-vals12]`; their xfail entries are dropped. ### Out of scope `test_masked_kleene_logic[any-False-...]` remains xfailed: a *mixed* True/False/NA group under `skipna=False` requires full three-valued (Kleene) NA propagation (e.g. `False OR NA -> NA`), which is a larger change than the all-null / all-valued cases addressed here. ### Tests Added `test_groupby_any_all_string_nulls` (cuDF unit test) covering `any`/`all` × `skipna` over null and valued string groups across all four `StringDtype` variants. It fails without this change. ### Verification - `pandas-testing/.../test_reductions.py` under `cudf.pandas` (with plugin): 1243 passed, 121 xfailed, 0 failed, 0 XPASS. - Full cuDF groupby unit suite: no regressions. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#22926
Description
GroupBy.all/GroupBy.anyreturned wrong results for string columns containing nulls. Two issues inGroupBy._bool_reduce's string→bool coercion:na_value=np.nanstring dtypes.count_characters() > 0does not propagate the source null (a NaN-flavored missing compares asFalse), so a missing value was treated as a real empty/falsy string. The source column's null positions are now restored on the bool column.astype(np.bool_)on a null-containing extension bool. Casting a null-containing pandas-extension bool to numpy bool is (intentionally) rejected in pandas-compatible mode. The bool column is now kept nullable through themin/maxaggregation and normalized tonp.bool_only after empty/skipnagroups are filled.As a result,
allover an all-null group is now vacuouslyTrueandanyisFalse(with nulls dropped underskipna=True), matching pandas across all fourStringDtypevariants (python/pyarrow×na_value=pd.NA/np.nan).This fixes
tests/groupby/test_reductions.py::test_string_dtype_all_na(40 parametrizations) in thecudf.pandaspandas test suite, and incidentallytest_anyandtest_groupby_bool_aggs[True-any-vals12]; their xfail entries are dropped.Out of scope
test_masked_kleene_logic[any-False-...]remains xfailed: a mixed True/False/NA group underskipna=Falserequires full three-valued (Kleene) NA propagation (e.g.False OR NA -> NA), which is a larger change than the all-null / all-valued cases addressed here.Tests
Added
test_groupby_any_all_string_nulls(cuDF unit test) coveringany/all×skipnaover null and valued string groups across all fourStringDtypevariants. It fails without this change.Verification
pandas-testing/.../test_reductions.pyundercudf.pandas(with plugin): 1243 passed, 121 xfailed, 0 failed, 0 XPASS.Checklist